Data Locality Optimization Strategies for AMR Applications on GPU - accelerated
نویسنده
چکیده
As the memory hierarchies of supercomputers get more complex, improving the performance of applications bounded by the data movement throughput becomes challenging. For example, TokyoTech’s TSUBAME, the machine we intend to use, includes different data transfer bottlenecks that complicate the domain decomposition and load balancing: inter-node connection, intra-node multi-GPU connections, and GPU memory hierarchy. Our main research target is to study the impact of hierarchical memory on the optimization strategies used in specific classes of memory-bound applications. Further, we are currently expanding our study to investigate the data movement challenge for Adaptive Mesh Refinement (AMR) applications. The resources we apply for in this proposal are to be specifically used for AMR applications to analyze the scalability of data locality optimizations that specialize the computation on the nearest memory level. Applying those optimizations can improve the performance of AMR applications in systems having complex memory hierarchies as in TSUBAME. The AMR method is widely used by a diversity of scientific applications. AMR methods are complex, and suffer from bottlenecks at different levels of data movement. We are motivated by challenges to AMR applications imposed by changes in the memory hierarchy. In our experiments, we demonstrated the potential of a data-centric approach for AMR applications. Further, were able to demonstrate that our data locality approach in AMR can scale to 3,640 GPUs for three real-world applications. 1. Basic Information (1) Collaborating JHPCN Centers Tokyo Tech (TSUBAME2.5) (2) Research Areas o Very large-scale numerical computation o Very large-scale data processing o Very large capacity network technology o Very large-scale information systems (3) Roles of Project Members (4) Mohamed Wahib, RIKEN AICS (PI), Naoya Maruyama, RIKEN AICS (Participant), Takayuiki Aoki TokyoTech (Participant) 2. Purpose and Significance of the Research The main purpose of this work is to identify and optimize for data locality in AMR applications running in GPU-accelerated supercomputers. First, we devised a performance model for the data-centric AMR method. This provides bases for guiding the problem decomposition and load balancing, if needed. Numerous performance models exist in HPC. However, to the author’s knowledge, analytical optimization for data locality in AMR, which influences problem decomposition is not covered in literature. Second, optimized Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures Final Report for JHPCN Joint Research of FY 2016, May 2017 2 data-centric AMR implementations for real world applications, including the phase-field simulation. Finally, we tested and evaluated the execution of real world applications at the full scale of TSUBAME. This did not only improve the performance of the tested applications; it also provided a realistic measure of the applicability of the framework to other applications implemented by other researchers. 3. Significance as a JHPCN Joint Research Project Two factors motivate the application for the proposed work as a JHPCN joint research project: human factor and machine factor. For the human factor, the group under Prof. Aoki at TokyoTech has leading expertise in AMR simulations for scientific applications. Moreover, the same group did highly valued work on fixed mesh phase-field simulations for 3D dendritic growth (2012 Gordon bell award). Introducing the data-centric AMR method to AMR applications, including the phase-filed simulation, is an opportunity for the proposed study. For the machine factor, the data-centric AMR approach is designed mainly for GPUs. Japan’s largest GPU accelerated supercomputer computer, TSUBAME, is hosted at TokyoTech. TSUBAME is an ideal test bed for scalability and performance studies in GPU applications. 4. Outline of the Research Achievements up to FY 2015 We conducted experiments on a single node of multi-GPUs for two different AMR applications. The experiments involved testing different data locality approaches within our AMR framework. Promising results were achieved when a technique for data-centric computation eliminated the CPU-GPU data transfer bottle-neck. More specifically, we utilized a specialization technique by which the CPU specializes on the operations touching the data structures used to manage the mesh (an octree in our framework), while the GPUs specialize on operations that touch the data arrays of the blocks. While this approach requires writing additional GPU CUDA kernels, it eliminates that need to transfer the blocks between the CPU and GPU every time the mesh is evaluated for changes. In comparison to baseline implementations that require the blocks to be transferred to the CPU, we reported speedups of up to 2.21x and 2.83x in data-centric AMR implementations of a hydrodynamics simulation and shallow-waters simulation, respectively [1][2]. Details of the results can be found in the publications listed in section 7. 5. Details of FY 2016 Research Achievements We introduced a high-level programming framework, named Daino, that provides a highly productive programming environment for AMR. The framework is transparent and requires minimal involvement from the Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures Final Report for JHPCN Joint Research of FY 2016, May 2017 3 programmer, while generating efficient and scalable AMR code. The framework consists of a compiler and runtime components. A set of directives allows the programmer to identify stencils of a uniform mesh in an architectureneutral way. The uniform mesh code is then translated to GPU-optimized parallel AMR code, which is then compiled to an executable. The runtime component encapsulates the AMR hierarchy and provides an interface for the mesh management operations. The framework is publically available at github: https://github.com/wahibium/Daino We demonstrate the scalability of auto-generated AMR code using three production applications. We compare the speedup and scalability with hand-written AMR of all three applications: Phase-field Simulation: This application simulates 3D dendritic growth during binary ally solidification. Hydrodynamics Solver: This solver models a 2 order directionally split hyperbolic schemes to solve Euler equations. Shallow-water: Modelling shallow water by depth-averaging Navier-Stokes equations. In a weak scaling experiment, shown in Figure 1, the run-time for uniform mesh, hand-written AMR, and auto-generated AMR are compared. The following points are important to note. First, more than 1.7x speedup is achieved using Daino suing the full TSUBAME machine, 3,640 GPUs, for the phase-field simulation. This is a considerable improvement considering that the uniform mesh implementation is a Gordon Bell prize winner for time-tosolution. Second, Daino achieves good Figure 1: Weak scaling of uniform mesh, hand-written and automated AMR scaling that comparable to the scalability of the hand-written AMR code. 1.0E+00 5.0E+02 1.0E+03 1.5E+03 2.0E+03 16 64 256 576 1024 160
منابع مشابه
GPU accelerated cell-based adaptive mesh refinement on unstructured quadrilateral grid
A GPU accelerated inviscid flow solver is developed on an unstructured quadrilateral grid in the present work. For the first time, the cell-based adaptive mesh refinement (AMR) is fully implemented on GPU for the unstructured quadrilateral grid, which greatly reduces the frequency of data exchange between GPU and CPU. Specifically, the AMR is processed with atomic operations to parallelize list...
متن کاملEfficient asynchronous executions of AMR computations and visualization on a GPU system
Adaptive Mesh Refinement is a method which dynamically varies the spatio-temporal resolution of localized mesh regions in numerical simulations, based on the strength of the solution features. Insitu visualization plays an important role for analyzing the time evolving characteristics of the domain structures. Continuous visualization of the output data for various timesteps results in a better...
متن کاملGAMER: a GPU-Accelerated Adaptive Mesh Refinement Code for Astrophysics
We present the newly developed code, GAMER (GPU-accelerated Adaptive MEsh Refinement code), which has adopted a novel approach to improve the performance of adaptive mesh refinement (AMR) astrophysical simulations by a large factor with the use of the graphic processing unit (GPU). The AMR implementation is based on a hierarchy of grid patches with an oct-tree data structure. We adopt a three-d...
متن کاملImproving cache locality for GPU-based volume rendering
We present a cache-aware method for accelerating texture-based volume rendering on a graphics processing unit (GPU). Because a GPU has hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize performance for memory-intensive applications. Our method localizes texture memory reference according to the location of the viewpoint and dynamically...
متن کاملEmploying Complex GPU Data Structures for the Interactive Visualization of Adaptive Mesh Refinement Data
We present a framework for interactively visualizing volumetric Adaptive Mesh Refinement (AMR) data. For this purpose we employ complex data structures to map the entire AMR dataset to graphics memory. This allows to apply hardware accelerated visualization algorithms previously only operating on uniform cartesian grids. For mapping the data to graphics memory we consider two approaches, a spac...
متن کامل